About the project

After learning online course from the Data Camp, I think all the programming languages are essentially the same. R is very similar to Python, and some functions are just like one the the Python library (Pandas), since they all use Dataframe object to store information, so I find it very easy to understand, and I think R is focused on the data processing where as Python is more of general purpose type of language.

My github repository is here.

My course diary web page is here.


R basics

something <- value
# some_comment (just like python)
Cmd/ctrl + Enter
+ # Addition
- # Subtraction
* # Multiplication
/ # Division
^ (or **) # Exponentiation
some_function(object)
mean(data_source)
head(students2014, n = a)
# its a good practice to add the argument name to avoid confusion
str(data_source)
# but in python it prints out string of an object
"Strongly disagree" = 1
"Strongly agree" = 5
# Boolean values: TRUE or FALSE 
# all upper case
c(2, 3, 4.1, 5) # has to be same data type
summary(data source)
# it will return some important statistical value of the selected data
plot(x, y, "p", main = "some title")
names[1]
# start at 1 not 0
names[1] <- "new_value"
names[c(1, 3)] # select value from these positions, order matters
c[1, 2, 3, 4, 5]
1:5

# manupulate the vector
(1:5)*2

# length of the data
length(data)

slicing
data[begin:end]
# possible to modify the vector when slicing a dataframe
==  # exactly equal to
!=  # not equal to
<   # less than
>   # greater than
<=  # less or equal to
>=  # greater or equal to
!a  # NOT a
a & b   # a AND b
a | b   # a OR b
subset(data_source, conditions...)
for (counter in vector) {
  commands
  more commands
}
function_name <- function(arg1, arg2 = 5) return(return_value) # arg = default value
varible <- read.table("source_of data", sep="\t", header=TRUE) # read data
dim(data) # check dimension
str(data) # check structure
colSums(df) # returns a sum of each column in df
rowSums(df) # returns a sum of each row in df
colMeans(df)    # returns the mean of each column in df
rowMeans(df) # return the mean of each row in df
select(dataframe, one_of(columes)) # select columns to create new datafram
colnames(data)[columm_index] <- "new_name" # modify column names
filtered_data <- filter(original_data, conditions) # filter the data with conditions

# If-else statement
if(condition) {
   do something
} else {
   do something else
}

c(1,2,3,4,5) / 2 # scaling vectors

%>% # pipe operator


select(data_source, one_of(vector_of_columns)) # select few columns as new one

colnames(data) # print colunm names

data_frame <- as.data.frame(scaled) # change data to data frame

sample() # choose ramdom data

mutate() # adding new variables as mutations of the existing ones

jointed_data <- inner_join(first_data, second_data, by = common_columns, suffix = c(".first", ".second"))

2. Learning Questionnaire Analysis

2.1 Introduction

The original data has 60 variables and 183 observations; most of the questions were given on Likert scale, from 1 to 5, except for the few background related variables (age, attitude, points). For the analysis part, the variables related to deep learning, surface learning, and strategic learning have been scaled using R script.

2.2 Data overview

# Read in the data
new_lrn14 <- as.data.frame(read.csv('data/learning2014.csv'))

A matrix of plots of variables of the data can be drawn as follows, coloured according to gender variable:

p <- ggpairs(new_lrn14, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p

summary(new_lrn14)

Results:

gender       age           attitude          deep            stra            surf           points     
 F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250   Min.   :1.583   Min.   : 7.00  
 M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
         Median :22.00   Median :32.00   Median :3.667   Median :3.188   Median :2.833   Median :23.00  
         Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121   Mean   :2.787   Mean   :22.72  
         3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
         Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000   Max.   :4.333   Max.   :33.00

The gender variable inllustrate that this course was attended by more female students. Most of the variables are distributed randomly and shows weak correlation with the points. The correlation between points and attitude is the largest between all the variables. Most of the variables seems to be close to normal distribution except gender and age.

2.3 Fitting a Regression Model

lrn2014_model <- lm(points ~ attitude + stra + surf, data = new_lrn14)
summary(lrn2014_model)
Call:
lm(formula = points ~ attitude + stra + surf, data = new_lrn14)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1550  -3.4346   0.5156   3.6401  10.8952 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.01711    3.68375   2.991  0.00322 ** 
attitude     0.33952    0.05741   5.913 1.93e-08 ***
stra         0.85313    0.54159   1.575  0.11716    
surf        -0.58607    0.80138  -0.731  0.46563    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.296 on 162 degrees of freedom
Multiple R-squared:  0.2074,    Adjusted R-squared:  0.1927 
F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

As can be seen from the model summary, the estimates of strategic and surface learning have large P-value and thus no statistical significance explaining the course points; also as expected according to the weak correlation with points variable. Thus, it makes more sense to remove the strategic learning due to the highest probability value.

Remove strategic learning variable:

lrn2014_model <- lm(points ~ attitude + stra, data = new_lrn14)
summary(lrn2014_model)
all:
lm(formula = points ~ attitude + surf, data = new_lrn14)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.277  -3.236   0.386   3.977  10.642 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.11957    3.12714   4.515 1.21e-05 ***
attitude     0.34264    0.05764   5.944 1.63e-08 ***
surf        -0.77895    0.79556  -0.979    0.329    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.32 on 163 degrees of freedom
Multiple R-squared:  0.1953,    Adjusted R-squared:  0.1854 
F-statistic: 19.78 on 2 and 163 DF,  p-value: 2.041e-08

2.4 Analysis of the Model Summary

With the removal of strategic learning variable, the statistical significance of the surface learning estimates improves and can be included to this model. The model is not very good since the multiple R2 value is only 0.1953, which means that about 80 % of the relationship between the dependent variable and the explanatory variables still remains unexplained. Therefore, any predictions based on the model might not be very reliable.

2.5 Diagnostic Plots and Assumptions

For a linear model, there are following assumptions:

  1. The errors are not correlated.
  2. The size of the errors does not depend on the explanatory variables.
  3. The errors of the model should be normally distributed.

The validity of these assumption can be tested by analysing the residuals of the model. In the following figure three different diagnostic plots are drawn:

  • Residuals vs. fitted values plot
  • Normal Q–Q plot
  • Residuals vs. leverage plot
par(mfrow = c(1,3))
plot(new_lrn14_model, which = c(1,2,5))

Interpretations:

  1. From the first plot, the residuals vs. fitted values plot doesn’t show any kind of pattern or correlation, so the errors are not correlated and their size is independent of the explanatory variables.
  2. The Q–Q-plot shows reasonably well fit between the standardised residuals and theoretical quantities, thus confirms the validity of normal distribution assumption.
  3. From the thid plor, we can conclude that the model is not distorted by any single observation since the x-axis scale of the residuals vs. leverage plot is relatively narrow and no significant outliers was observed.

3. Alcohol Consumption Analysis

3.1 Introduction

3.2 Overview of the Data

alc <- as.data.frame(read.table('data/alc.csv',  sep="\t", header=TRUE))
glimpse(alc)

Results:

Observations: 382
Variables: 35
$ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
$ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M, M, M, M, M, F, F, M, M, M, M,…
$ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 17, 16, 15, 15, 1…
$ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, R, U, U, U, U, U,…
$ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3, LE3, GT3, GT3, GT3, GT3, GT3,…
$ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T, T, T, T, T, T, T, T, T, A, T,…
$ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4, 4, 4, 4, 2, 2, 2, 2, 4, 3, 4,…
$ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3, 3, 4, 2, 2, 4, 2, 2, 2, 4, 4,…
$ Mjob       <fct> at_home, at_home, at_home, health, other, services, other, other, services, other, teache…
$ Fjob       <fct> teacher, other, other, services, other, other, other, teacher, other, other, health, othe…
$ reason     <fct> course, course, other, home, home, reputation, home, home, home, home, reputation, reputa…
$ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
$ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, no, yes…
$ guardian   <fct> mother, father, mother, mother, father, mother, mother, mother, mother, mother, mother, f…
$ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,…
$ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1, 2, 1, 2, 2, 3, 1, 1, 1, 2, 2,…
$ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
$ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, no, no, no, yes, no, no, no, n…
$ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ye…
$ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes, yes, no, no, yes, no, no, yes…
$ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no, no, no, yes, yes, yes, yes, …
$ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
$ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
$ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3, 4, 5, 4, 5, 4, 1, 4, 2, 5, 4,…
$ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1, 4, 4, 5, 4, 3, 2, 2, 2, 3, 4,…
$ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3, 1, 2, 1, 4, 2, 2, 2, 4, 3, 5,…
$ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 5,…
$ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3, 1, 1, 3, 4, 1, 3, 2, 4, 1, 5,…
$ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5, 1, 5, 5, 5, 5, 5, 5, 1, 5, 5,…
$ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5, 0, 0, 1, 1, 2, 10, 5, 2, 3, 1…
$ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16, 13, 10, 7, 10, 12, 12, 14, 12…
$ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16, 14, 12, 6, 11, 14, 14, 14, 1…
$ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 16, 14, 12, 6, 11, 14, 14, 15, …
$ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0, 2.0, 1.5, 1.0, 1.5, 1.5, 1.0,…
$ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…

3.3 Purpose of the Analysis

I’ve chosen the following 4 variables that I assume are indicative of students’ alcohol consumption:

  • Final grade (variable G3). My assumption is that who got lower grade drink more.
  • Failures (variable failures). My assumption is that who fails more drink more.
  • Absence of lessons (variable absences). My assumption is that who skip lessons drink more.
  • Study time (variable studytime). My assumption is that who spend less time studying spend more time drinking.

A summary and plots of the chosen variables are shown below. Also, grouped the box plots by sex to see possible correlation.

summary(alc[c('G3','failures','absences','studytime')])
       G3           failures         absences      studytime    
 Min.   : 0.00   Min.   :0.0000   Min.   : 0.0   Min.   :1.000  
 1st Qu.:10.00   1st Qu.:0.0000   1st Qu.: 1.0   1st Qu.:1.000  
 Median :12.00   Median :0.0000   Median : 3.0   Median :2.000  
 Mean   :11.46   Mean   :0.2016   Mean   : 4.5   Mean   :2.037  
 3rd Qu.:14.00   3rd Qu.:0.0000   3rd Qu.: 6.0   3rd Qu.:2.000  
 Max.   :18.00   Max.   :3.0000   Max.   :45.0   Max.   :4.000
# plot 

p1 <- ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + xlab('high use')
p2 <- ggplot(alc, aes(failures)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count") + xlab('going out')
p3 <- ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + ylab('final grade') + xlab('high use')
p4 <- ggplot(alc, aes(studytime)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count")

3.4 Logistic Regression Analysis

m <- glm(high_use ~ G3 + failures + absences + studytime, data = alc, family = "binomial")
summary(m)
Call:
glm(formula = high_use ~ G3 + failures + absences + studytime, 
    family = "binomial", data = alc)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1570  -0.8393  -0.6605   1.1103   2.1227  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.02933    0.56149   0.052 0.958346    
G3          -0.03519    0.03868  -0.910 0.362964    
failures     0.29443    0.20422   1.442 0.149383    
absences     0.07766    0.02271   3.420 0.000626 ***
studytime   -0.47423    0.15920  -2.979 0.002893 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 465.68  on 381  degrees of freedom
Residual deviance: 429.15  on 377  degrees of freedom
AIC: 439.15

Number of Fisher Scoring iterations: 4

According to the model summary, my hypothesis was wrong with regard the final grade and failure. The other two explanatory variables are relatively strong predictors. Odds ratios were calculated:

or <- exp(coef(m))
or
(Intercept)          G3    failures    absences   studytime 
  1.0297605   0.9654265   1.3423577   1.0807536   0.6223632 

As shown by the odds ratios, a student has virtually the same likelihood to consume much alcohol regardless of the grade or the absence. With regard failures and studytime the results are, however, a student who fails more likely to also drink a lot. Andwho studies a lot is almost half less likely to drink a lot.

Confidence intervals:

ci <- exp(confint(m))
cbind(or, ci)
                   or     2.5 %    97.5 %
(Intercept) 1.0297605 0.3412502 3.1040114
G3          0.9654265 0.8948227 1.0417846
failures    1.3423577 0.9000830 2.0158575
absences    1.0807536 1.0359926 1.1326415
studytime   0.6223632 0.4513693 0.8437505

With regard the final grade, my hypothesis was wrong because the value one is almost in the middle of its confidence interval. Thus, before making any actual predictions using the model, it’s best to remove that variable from the model:

m <- glm(high_use ~ failures + absences + studytime, data = alc, family = "binomial")
summary(m)
Call:
glm(formula = high_use ~ failures + absences + studytime, family = "binomial", 
    data = alc)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1536  -0.8385  -0.6825   1.1469   2.1604  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.36590    0.35603  -1.028 0.304080    
failures     0.36378    0.19023   1.912 0.055845 .  
absences     0.07897    0.02271   3.477 0.000507 ***
studytime   -0.48619    0.15855  -3.066 0.002166 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 465.68  on 381  degrees of freedom
Residual deviance: 429.97  on 378  degrees of freedom
AIC: 437.97

Number of Fisher Scoring iterations: 4
cbind(exp(coef(m)), exp(confint(m)))
                          2.5 %    97.5 %
(Intercept) 0.6935745 0.3442726 1.3944631
failures    1.4387512 0.9917543 2.1021998
absences    1.0821747 1.0373417 1.1341316
studytime   0.6149678 0.4465274 0.8325472

3.5 Predicting with the Model

# Predict the probability.
prob <- predict(m, type = "response")
# Add the probabilities to alc.
alc <- mutate(alc, probability = prob)
# Calculate a logical high use value based on probabilites.
alc <- mutate(alc, prediction = probability > 0.5)
# Tabulate the target variable versus the predictions,
# with both absolute and proportional numbers.
tbl <- table(high_use = alc$high_use, prediction = alc$prediction)
addmargins(tbl)
        prediction
high_use FALSE TRUE Sum
   FALSE   254   14 268
   TRUE     91   23 114
   Sum     345   37 382
round(addmargins(prop.table(tbl)), 2)
        prediction
high_use FALSE TRUE  Sum
   FALSE  0.66 0.04 0.70
   TRUE   0.24 0.06 0.30
   Sum    0.90 0.10 1.00
high_u <- as.data.frame(prop.table(table(alc$high_use)))
predic <- as.data.frame(prop.table(table(alc$prediction)))
pp1 <- ggplot(high_u, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('observed high use') + theme(legend.position = 'none')
pp2 <- ggplot(predic, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('predicted high use') + theme(legend.position = 'none')
multiplot(pp1, pp2, cols = 2)

mloss <- function(obs, prob) {
  res <- ifelse(prob > 0.5, 1, 0)
  mean(res != obs)
}
round(mloss(obs = alc$high_use, prob = alc$probability), 2)
[1] 0.27

The training error is 27%, the accuracy of this model is 73%.


4. Classification of the Boston Dataset

4.1 Overview

# access the MASS package
library(MASS)

# load the data
data("Boston")

glimpse(Boston)
Observations: 506
Variables: 14
$ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.14455, 0.21124, 0.17004, 0.…
$ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 0.0, 0.0, 0.0, 0.0,…
$ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 7.87, 8.14, 8.14, 8.…
$ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524, 0.524, 0.524, 0.524, 0.524, 0…
$ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631, 6.004, 6.377, 6.009, 5.889, 5…
$ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 94.3, 82.9, 39.0, 61.8, 84.5, 5…
$ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505, 6.0821, 6.5921, 6.3467, 6.22…
$ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 311, 307, 307, 307, 307, 307, 30…
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 15.2, 21.0, 21.0, 21…
$ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90, 386.63, 386.71, 392.52, 396.…
$ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10, 20.45, 13.27, 15.71, 8.26, 1…
$ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15.0, 18.9, 21.7, 20.4, 18.2, 19…
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 10)), upper = list(continuous = wrap("cor", size=3)))
p

summary(Boston)
     crim                zn             indus            chas              nox               rm       
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000   Min.   :0.3850   Min.   :3.561  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000   1st Qu.:0.4490   1st Qu.:5.886  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000   Median :0.5380   Median :6.208  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917   Mean   :0.5547   Mean   :6.285  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000   3rd Qu.:0.6240   3rd Qu.:6.623  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000   Max.   :0.8710   Max.   :8.780  
      age              dis              rad              tax           ptratio          black       
 Min.   :  2.90   Min.   : 1.130   Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 77.50   Median : 3.207   Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 68.57   Mean   : 3.795   Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :100.00   Max.   :12.127   Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  

4.2 Standardising and Categorising

The dataset must be standardised which means all variables fit to normal distribution so that the mean of every variable is zero. This can be done as follows:

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
     crim                 zn               indus              chas              nox         
 Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
 1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
 Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
 Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
       rm               age               dis               rad               tax             ptratio       
 Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
 1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
 Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
 Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
     black             lstat              medv        
 Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
 1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
 Median : 0.3808   Median :-0.1811   Median :-0.1449  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
 Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865  

We also need to categorise our target variable – crim – to classify it:

# Create a quantile vector of crim, and use it to create the categorical "crime".
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# Replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crim) # Explore the categorised variable.
##      low  med_low med_high     high 
##      127      126      126      127

In order to create the LDA model and to test it, the data has to be divided into training and testing sets.

n <- nrow(boston_scaled) # Get number of rows in the dataset.
ind <- sample(n,  size = n * 0.8) # Choose randomly 80% of the rows.
train <- boston_scaled[ind,] # Create train set.
test <- boston_scaled[-ind,] # Create test set.
# Save the correct classes from the test data.
correct_classes <- test$crime
# Remove the crime variable from the test data.
test <- dplyr::select(test, -crime)

4.3 Fitting the Model

lda.fit <- lda(crime ~ ., data = train)
lda.fit
Call:
lda(crime ~ ., data = train)

Prior probabilities of groups:
      low   med_low  med_high      high 
0.2475248 0.2549505 0.2500000 0.2475248 

Group means:
                 zn       indus         chas        nox         rm        age        dis        rad        tax
low       0.8788254 -0.90299708 -0.193587063 -0.8578620  0.4283921 -0.8810268  0.8563773 -0.6993483 -0.7586291
med_low  -0.1000971 -0.32495455 -0.004759149 -0.5695319 -0.1409065 -0.3721597  0.3648300 -0.5470148 -0.4791331
med_high -0.3751653  0.09765591  0.234426408  0.3247534  0.1474674  0.3919174 -0.3256314 -0.3974038 -0.3377904
high     -0.4872402  1.01715195 -0.075474056  1.0130937 -0.3741645  0.7948136 -0.8597796  1.6377820  1.5138081
             ptratio       black       lstat         medv
low      -0.42566133  0.38009442 -0.75226396  0.528774452
med_low  -0.09874892  0.31131227 -0.12813291  0.003927945
med_high -0.28221456  0.09962635 -0.03927833  0.188486538
high      0.78037363 -0.77500232  0.89169331 -0.705092197

Coefficients of linear discriminants:
                LD1         LD2         LD3
zn       0.15327760  0.72696090 -0.96918821
indus    0.06895212 -0.23203144  0.19852525
chas    -0.08674273 -0.16587825  0.09866070
nox      0.29905137 -0.69375411 -1.34777143
rm      -0.08073322 -0.13239391 -0.18618294
age      0.35186877 -0.34166351 -0.30960892
dis     -0.04116222 -0.25001198  0.05134464
rad      3.22140261  0.98659875 -0.38833796
tax     -0.01928876 -0.13190509  0.89928951
ptratio  0.11143195  0.02470684 -0.30292304
black   -0.12568460  0.01433414  0.08414446
lstat    0.21707935 -0.13116253  0.54061068
medv     0.17787842 -0.28268695 -0.07678544

Proportion of trace:
   LD1    LD2    LD3 
0.9527 0.0337 0.0136

The output shows that we have three linear discriminants. The first explains vast majority – 95.27% – of the between-group variance.

# Define a function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime) # Turn the classes to numeric for plotting.
plot(lda.fit, dimen = 2, col = classes, pch = classes) # Plot.
lda.arrows(lda.fit) # Add arrows.

4.4 Predicting

lda.pred <- predict(lda.fit, newdata = test) # Predict the test values.
# Cross tabulate the predictions with the correct values.
table(correct = correct_classes, predicted = lda.pred$class)
          predicted
correct    low med_low med_high high
  low       15       6        3    0
  med_low    3      18        4    0
  med_high   0      15       15    2
  high       0       0        0   21

As seen from the table, the model didn’t make good predict for the “med_high” category. Thus, the model can be used to make crude predictions, but it’s hardly perfect.

4.5 Clustering

boston_scaled <- as.data.frame(scale(Boston)) # Standardise the data.
dist_eu <- dist(boston_scaled) # Create an euclidian distance matrix.
summary(dist_eu) # Summarise the matrix.
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1343  3.4625  4.8241  4.9111  6.1863 14.3970 

We can try to cluster the data with k-means straight away. We used four classes for our LDA model, so we might try it with as many clusters instead:

km <-kmeans(dist_eu, centers = 4) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.

However, while the results look somewhat reasonable, the amount of clusters was merely a guess. To determine it properly, the total within cluster sum of squares (TWCSS) should be calculated. Let’s try it, with a maximum of 15 clusters:

k_max <- 15 # Maximum number of clusters to try.
# Define a function for testing.
k_try <- function(k) {
  kmeans(dist_eu, k)$tot.withinss
}
# Calculate the total within sum of squares using the function.
twcss <- sapply(1:k_max, k_try)

# Visualize the results.
plot(1:k_max, twcss, type='b')

For comparison, re-cluster the data with just two clusters:

km <-kmeans(dist_eu, centers = 2) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.

It can be seen from the plot there is lees overlap compared to that of four clusters, which means the optimal clusters is two.


5. Tea and Human Development

5.1 Data overview

# Read in the data
human <- as.data.frame(read.table('data/human.csv',  sep="\t", header=TRUE))
glimpse(human)
Observations: 155
Variables: 8
$ se_f_of_m  <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.9690608, 0.9927835, 1.0241730, 1.0031646, 1…
$ lfp_f_of_m <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.8286119, 0.8072289, 0.7797357, 0.8171263, 0…
$ edu_exp    <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9, 19.2, 15.4, 15.8, 16.2, 19.0, 16.9,…
$ life_exp   <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0, 81.8, 83.0, 82.2, 80.7, 82.6, 81.9,…
$ gni_cap    <dbl> 64992, 42261, 56431, 44025, 45435, 43919, 39568, 52947, 42155, 32689, 76628, 45636, 39267…
$ mmr        <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 11, 6, 6, 12, 4, 4, 7, 4, 4, 5, 5, 11,…
$ abr        <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3, 6.0, 6.5, 25.8, 11.5, 2.2, 7.8, 8.3…
$ mp_share   <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2, 31.4, 25.3, 43.6, 23.5, 41.3, 16.3,…

From the output, there are 155 observations (countries) and 8 variables, the description is shown below:

  • se_f_of_m = ratio of Female and Male populations with secondary education in each country (i.e. edu2F / edu2M).
  • lfp_f_of_m = ratio of labour force participation of females and males in each country (i.e. labF / labM).
  • edu_exp = expected years of education
  • life_exp = life expectancy
  • gni_cap = gross national income (GNI) per capita (dollars, purchasing power parity)
  • mmr = maternal mortality rate
  • abr = adolescent birth rate
  • mp_share = share of female representatives in the national parliament
ggpairs(human)

summary(human)
  se_f_of_m        lfp_f_of_m        edu_exp         life_exp        gni_cap            mmr        
 Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00   Min.   :   581   Min.   :   1.0  
 1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30   1st Qu.:  4198   1st Qu.:  11.5  
 Median :0.9375   Median :0.7535   Median :13.50   Median :74.20   Median : 12040   Median :  49.0  
 Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65   Mean   : 17628   Mean   : 149.1  
 3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25   3rd Qu.: 24512   3rd Qu.: 190.0  
 Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50   Max.   :123124   Max.   :1100.0  
      abr            mp_share    
 Min.   :  0.60   Min.   : 0.00  
 1st Qu.: 12.65   1st Qu.:12.40  
 Median : 33.60   Median :19.30  
 Mean   : 47.16   Mean   :20.91  
 3rd Qu.: 71.95   3rd Qu.:27.95  
 Max.   :204.80   Max.   :57.50  
corrplot(round(cor(human), digits = 2), method = "circle", type = "upper", tl.pos = "d", tl.cex = 0.8)

The plot shows strongest correlation of all is the strong negative correlation between life expectancy (life_exp) and maternal mortality rate (mmr). The next strongest correlations are the positive correlations between life expectancy (life_exp) and education expectancy (edu_exp) and maternal mortality rate (mmr) and adolescent birth rate (abr).

5.2 PCA

pca_human <- prcomp(human)
s_human_nonstd <- summary(pca_human)
s_human_nonstd
Importance of components:
                             PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# Percentages of variance for the plot titles.
pr_shns <- round(100*s_human_nonstd$importance[2, ], digits = 1)
pc_shns_lab <- paste0(names(pr_shns), " (", pr_shns, "%)")
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_shns_lab[1], ylab = pc_shns_lab[2])

human_std <- scale(human) # Standardise the variables.
pca_human_std <- prcomp(human_std)
s_human_std <- summary(pca_human_std)
s_human_std
pr_shs <- round(100*s_human_std$importance[2, ], digits = 1)
pc_shs_lab <- paste0(names(pr_shs), " (", pr_shs, "%)")
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_shs_lab[1], ylab = pc_shs_lab[2])

## 5.3 Interpreting the PCA Results

The 1st principal component (PC1) explains 53.6% of the total variance of the original 8 variables, and the 2nd principal component (PC2) explains 16.2% of total variance of the original 8 variables.

From the plot, we can conclude that the PC1 represents variables related mostly to poverty&health and the PC2 related mostly to gender equality.

Its safe to say that total variance in the data coms from poverty and health, but gender equality also explains some of it.

From the arrows, we can see that high maternal mortality rate (mmr) and adolescent birth rate (abr) correlate strongly with poverty and that high life expectancy (life_exp), high educational expectancy (edu_exp), high ratio of females with secondary education (se_f_of_m) and high GNI (gni_cap) have a strong negative correlation with it.

Further, we can see that high ratio of female MPs (mp_share) and high ratio of female participation in the labour force (lfp_f_of_m) have strong positive correlation with gender equality.

5.4 Tea data analysis

data("tea")
glimpse(tea)
Observations: 300
Variables: 36
$ breakfast        <fct> breakfast, breakfast, Not.breakfast, Not.breakfast, breakfast, Not.breakfast, break…
$ tea.time         <fct> Not.tea time, Not.tea time, tea time, Not.tea time, Not.tea time, Not.tea time, tea…
$ evening          <fct> Not.evening, Not.evening, evening, Not.evening, evening, Not.evening, Not.evening, …
$ lunch            <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lu…
$ dinner           <fct> Not.dinner, Not.dinner, dinner, dinner, Not.dinner, dinner, Not.dinner, Not.dinner,…
$ always           <fct> Not.always, Not.always, Not.always, Not.always, always, Not.always, Not.always, Not…
$ home             <fct> home, home, home, home, home, home, home, home, home, home, home, home, home, home,…
$ work             <fct> Not.work, Not.work, work, Not.work, Not.work, Not.work, Not.work, Not.work, Not.wor…
$ tearoom          <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, Not.t…
$ friends          <fct> Not.friends, Not.friends, friends, Not.friends, Not.friends, Not.friends, friends, …
$ resto            <fct> Not.resto, Not.resto, resto, Not.resto, Not.resto, Not.resto, Not.resto, Not.resto,…
$ pub              <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, No…
$ Tea              <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl Grey, Earl Grey, black, Earl Gr…
$ How              <fct> alone, milk, alone, alone, alone, alone, alone, milk, milk, alone, alone, alone, mi…
$ sugar            <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar, No.sugar, No.sugar, No.sugar,…
$ how              <fct> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag, tea bag, tea bag, tea bag+unp…
$ where            <fct> chain store, chain store, chain store, chain store, chain store, chain store, chain…
$ price            <fct> p_unknown, p_variable, p_variable, p_variable, p_variable, p_private label, p_varia…
$ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 31, 56, 66, 65, 60, 35, 72, 73, 80, 76,…
$ sex              <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, M, M, F, F, F, F, F, F, F, F, F, F, F,…
$ SPC              <fct> middle, middle, other worker, student, employee, student, senior, middle, senior, s…
$ Sport            <fct> sportsman, sportsman, sportsman, Not.sportsman, sportsman, sportsman, sportsman, sp…
$ age_Q            <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-44, 35-44, 35-44, 35-44, 25-34, 25-34,…
$ frequency        <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3 to 6/week, 1 to 2/week, +2/day, +2/da…
$ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.escape-exoticism, escape-exoticism, esc…
$ spirituality     <fct> Not.spirituality, Not.spirituality, Not.spirituality, spirituality, spirituality, N…
$ healthy          <fct> healthy, healthy, healthy, healthy, Not.healthy, healthy, healthy, healthy, Not.hea…
$ diuretic         <fct> Not.diuretic, diuretic, diuretic, Not.diuretic, diuretic, Not.diuretic, Not.diureti…
$ friendliness     <fct> Not.friendliness, Not.friendliness, friendliness, Not.friendliness, friendliness, N…
$ iron.absorption  <fct> Not.iron absorption, Not.iron absorption, Not.iron absorption, Not.iron absorption,…
$ feminine         <fct> Not.feminine, Not.feminine, Not.feminine, Not.feminine, Not.feminine, Not.feminine,…
$ sophisticated    <fct> Not.sophisticated, Not.sophisticated, Not.sophisticated, sophisticated, Not.sophist…
$ slimming         <fct> No.slimming, No.slimming, No.slimming, No.slimming, No.slimming, No.slimming, No.sl…
$ exciting         <fct> No.exciting, exciting, No.exciting, No.exciting, No.exciting, No.exciting, No.excit…
$ relaxing         <fct> No.relaxing, No.relaxing, relaxing, relaxing, relaxing, relaxing, relaxing, relaxin…
$ effect.on.health <fct> No.effect on health, No.effect on health, No.effect on health, No.effect on health,…
cha <- dplyr::select(tea, one_of(c('Tea','How','sugar','where','healthy','sex')))
glimpse(cha)
Observations: 300
Variables: 6
$ Tea     <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl Grey, Earl Grey, black, Earl Grey, black…
$ How     <fct> alone, milk, alone, alone, alone, alone, alone, milk, milk, alone, alone, alone, milk, milk,…
$ sugar   <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar, No.sugar, No.sugar, No.sugar, No.sugar…
$ where   <fct> chain store, chain store, chain store, chain store, chain store, chain store, chain store, c…
$ healthy <fct> healthy, healthy, healthy, healthy, Not.healthy, healthy, healthy, healthy, Not.healthy, hea…
$ sex     <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, M, M, F, F, F, F, F, F, F, F, F, F, F, F, F, F,…
ggplot(gather(cha), aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

cha_mca <- MCA(cha, graph = FALSE)
summary(cha_mca)

Call:
MCA(X = cha, graph = FALSE) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7   Dim.8   Dim.9  Dim.10
Variance               0.234   0.220   0.212   0.193   0.169   0.159   0.139   0.124   0.119   0.096
% of var.             14.063  13.193  12.749  11.603  10.119   9.545   8.356   7.462   7.141   5.769
Cumulative % of var.  14.063  27.256  40.005  51.608  61.727  71.271  79.627  87.090  94.231 100.000

Individuals (the 10 first)
               Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2  
1           | -0.104  0.015  0.009 | -0.165  0.041  0.023 |  0.325  0.165  0.089 |
2           |  0.577  0.473  0.212 | -0.259  0.102  0.043 |  0.047  0.003  0.001 |
3           |  0.180  0.046  0.052 | -0.255  0.099  0.106 | -0.568  0.507  0.523 |
4           | -0.469  0.313  0.286 |  0.097  0.014  0.012 | -0.019  0.001  0.000 |
5           | -0.388  0.215  0.142 | -0.150  0.034  0.021 | -0.257  0.104  0.062 |
6           | -0.089  0.011  0.011 | -0.261  0.103  0.091 | -0.126  0.025  0.021 |
7           | -0.089  0.011  0.011 | -0.261  0.103  0.091 | -0.126  0.025  0.021 |
8           |  0.577  0.473  0.212 | -0.259  0.102  0.043 |  0.047  0.003  0.001 |
9           |  0.007  0.000  0.000 |  0.486  0.357  0.119 |  0.199  0.062  0.020 |
10          |  0.640  0.583  0.266 | -0.145  0.032  0.014 |  0.403  0.255  0.105 |

Categories (the 10 first)
                Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test     Dim.3     ctr    cos2
black       |   0.826  11.979   0.224   8.177 |  -0.320   1.919   0.034  -3.170 |   0.608   7.164   0.121
Earl Grey   |  -0.235   2.536   0.100  -5.468 |   0.415   8.395   0.311   9.636 |  -0.342   5.893   0.211
green       |  -0.476   1.774   0.028  -2.895 |  -1.708  24.331   0.361 -10.385 |   0.634   3.469   0.050
alone       |  -0.117   0.638   0.026  -2.768 |  -0.357   6.285   0.237  -8.416 |  -0.364   6.768   0.247
lemon       |  -0.129   0.130   0.002  -0.782 |   1.152  11.064   0.164   7.003 |   1.112  10.662   0.153
milk        |  -0.026   0.010   0.000  -0.235 |   0.369   2.163   0.036   3.286 |   0.386   2.455   0.040
other       |   3.202  21.870   0.317   9.737 |   0.934   1.985   0.027   2.841 |   1.116   2.931   0.039
No.sugar    |   0.534  10.463   0.304   9.541 |  -0.486   9.248   0.252  -8.688 |  -0.142   0.823   0.022
sugar       |  -0.570  11.185   0.304  -9.541 |   0.519   9.885   0.252   8.688 |   0.152   0.879   0.022
chain store |  -0.237   2.562   0.100  -5.470 |  -0.203   2.001   0.073  -4.682 |  -0.335   5.622   0.199
             v.test  
black         6.021 |
Earl Grey    -7.936 |
green         3.855 |
alone        -8.586 |
lemon         6.758 |
milk          3.442 |
other         3.394 |
No.sugar     -2.547 |
sugar         2.547 |
chain store  -7.715 |

Categorical variables (eta2)
              Dim.1 Dim.2 Dim.3  
Tea         | 0.229 0.457 0.211 |
How         | 0.318 0.284 0.291 |
sugar       | 0.304 0.252 0.022 |
where       | 0.248 0.306 0.362 |
healthy     | 0.159 0.020 0.028 |
sex         | 0.147 0.000 0.362 |

Drawing a variable biplot of the analysis:


par(mfrow = c(1,3)) # Set some graphical params.
plot(cha_mca, choix = "var", title = "MCA variables") # The variable biplot.
plot(cha_mca, choix = "ind", invisible = "var") # The individuals plot.
plot(cha_mca, choix = "ind", invisible = "ind") # The categories plot.

As can be seen from the leftmost plot, the strongest link is the tea variable’s – i.e. types of tea – link to the second dimension.

The plots on the center and on the right shows no clear pattern in either of them.